Introduction

One health is a relative new concept that verses about the integrated health between animals, humans and environment. Simply put it seems to be a application of ecology and evolution into health and well-being. It brings the ideia that all health are intercorrelated whicth is quite percetible in this pandemic situation that we are living right now. Having in mind that we, as humans, are generating a great evolution pressure in plants, animals and, especially, in micro-organisms. The last is the most susceptible organims to the evolution pressure, and this can be noticed by the widespread antibiotic resistance and frequent appearece of new superbugs every now and then. On top of that, humans already see the problems but do not understand it completely, so many cientists are working in this subject.

During one of our cientific meetings one proffessor brought up the fact that without previous information It is extremely difficult to understand the geographic or ambiental source of a micro-organism. This information is important to understand, for exemple if bacteria that is sampled in farms or rivers come to contact with humans.

Furthermore, I hypothesize that this information may be hidden into bacteria genome. Some bacteria might have specific traits that could specify where they live, but as everything in biology things are not that simple, surely this target data might be hidden under a complex interations beteewn genes that cannot be scrutinized by human logic. That is why we have got computers! Maybe inputing good data into Machine Learning models could bring up this interaction and moreover could bring great genetics insights about bacteria distribution throughtout the environment.

Objectives

*The main objective is to find out bacteria origin in one of the sprectrums that nucleotide database give us.

*Secondarily find new insights about specific genes and specific niches may be interesting to study about

References

The current work was not based in no articles or studys. I read some abstracts and I have made some research on PubMed but I was not able to find any interesting reference or similar study, but It might exist. Futher digging is needed

Current work

–> Everything are being made in R language, I started to learn It in the beggining of the quarantine, at the same time that I have made a bioinformatics course. So I am pretty new at both, but this is what I was already able to do.

Getting accession ID numbers

Getting from a online spreadsheet previous obtained data by downloading accession ids on NCBI nucleotide plataform

link <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTgehkctLr41ea7BgkPw-jwXL5poeo0Ce5HRfnL8Ydw__tFeTCI_
sxYjkkVtcctErqhNAA2bj6Ans9J/pub?gid=1011092740&single=true&output=csv"

Accession_id <- read.csv(link) %>% as.tibble()

bacteria_data_big <- tibble()

Using rentrez to access summary data from each id and then processing this wanted data into a tibble

In regard to uid, I had problems with it because of some uid was the same for different bacteria, this must not be happen, so I used accession number

Getting target data

This for in loop can return error sometimes,moreover I will think in a way to resolve the problem

for (i in 1:length(Accession_id$ID)) {
  print(paste0("ongoing", i ))
  tryCatch({
    accession_id = Accession_id[i,] %>% 
      as.character()
    
    bacteria_summary <- 
      rentrez::entrez_summary("nucleotide", id= accession_id)
    
    bacteria_subtype <- 
      bacteria_summary$subtype %>% 
      str_split("\\|") 
    
    bacteria_subname <- 
      bacteria_summary$subname %>% 
      str_split("\\|")
    
    bacteria_data_lil <- 
      tibble(accession_id, 
             bacteria_subname, 
             bacteria_subtype)
    
    bacteria_data_big <- 
      bind_rows(bacteria_data_big, 
                bacteria_data_lil)
  }, error = function(e) {
    
    print(paste0("Deu merda no indice:", i))
    
    indice_erro <- c(indice_erro, i)
    
  })
  
  if (i == length(Accession_id$ID)) {print("complete")}
  
}

# Each row now has a list with the variable and the value of this variable, so first 
# we unlist both, which give us a new accessible list  

Moreover things run smoothly

lasting_tib <- tibble()

for (x in 1:length(Accession_id$ID)) {
  
  print(paste0("ongoing ", x))
  
  list_subtype = bacteria_data_big$bacteria_subtype[[x]] %>% 
    unlist()
  
  list_subname = bacteria_data_big$bacteria_subname[[x]] %>% 
    unlist()
  
  list_accession_id <- bacteria_data_big$accession_id[[x]] 
  
  #Now we expand our tibble putting all variables and values in two collumns 
  
  for (y in 1:length(list_subtype)) {
    
    fleeting_tib <- tibble(ID= list_accession_id, 
                           subtype= list_subtype[y], 
                           subname= list_subname[y])
    
    lasting_tib <- bind_rows(lasting_tib, 
                             fleeting_tib) 
    
    print(paste0(y ," complete" ))
  }
}

write.csv(lasting_tib,"lasting_tibble_ALL.csv")

#Because some error when writing infomartion on the NCBI plataform some collumns
#can come with missing values, this would generate problems in the following steps.

#Opening one of those bateria that had generated problems I was able to see that there was none
#so the problem must be in the program that have made the scraping
#So we must drop these blank values

lasting_tib %>% 
  mutate_all(na_if, "") %>% #summary of na values 
  is.na() %>%
  summary()

lasting_tib %>% #summary of treated
  mutate_all( na_if, "") %>% 
  drop_na() %>% 
  is.na() %>% 
  summary()

lasting_tib_treated <-
  lasting_tib %>% 
  mutate_all( na_if, "") %>%
  drop_na()


# At last using pivot wider we generate new collumns for the variables and complete 
# them with value or NA

Here I eliminate all the Bacteria thet could generate a Problem

#finding duplicates

lasting_tib_treated <- lasting_tib_treated[,2:length(lasting_tib_treated)]

bacteria_dataset_lenght %>% summary() 

#here you can see if is there any duplicates on the data set
#If the mean equal 1, everything is alright, but if not it means that 
# one bacteria shows up more then once, for pragramic reasons I'll drop then 
#Since I found problems only in host I made this resolution only for it, if the problem exists
#other variable you must do the same but with the other variable 

ID2_drop <- 
  lasting_tib_treated %>% 
  filter( subtype == "host") %>% 
  mutate(rn = data.table::rowid(ID)) %>%
  filter(rn >= 2) %>% select(1) %>% unlist()

lasting_tib_treated <- lasting_tib_treated %>% filter(!ID %in% ID2_drop)

# At last using pivot wider we generate new collumns for the variables and complete 
# them with value or NA


bacteria_dataset <- 
  lasting_tib_treated %>% 
  pivot_wider(names_from = subtype,
              values_from = subname)


write.csv(bacteria_dataset, "bacteria_dataset_ALL.csv" )

Categorization of target data

Target data still pretty crude, there are too many levels and these levels must be distribuited in few categories.

#Retriving data

analysis_data <-  
  read.csv("bacteria_dataset_ALL.csv") 

#Analysis of data 
analysis_data %>% 
  arrange() %>% 
  head()

The only data that I suppose to be wothy to study was host, isolation source and country. Therefore, next we have some data about then

analysis_data %>% 
  colnames()
##  [1] "X"                  "ID"                 "strain"            
##  [4] "sub_strain"         "country"            "isolation_source"  
##  [7] "lat_lon"            "collection_date"    "note"              
## [10] "host"               "serotype"           "pathovar"          
## [13] "collected_by"       "serovar"            "culture_collection"
## [16] "isolate"            "genotype"           "old_name"          
## [19] "serogroup"          "subtype"            "subgroup"          
## [22] "old_lineage"        "sub_species"        "type_material"     
## [25] "identified_by"      "chromosome"         "phenotype"         
## [28] "altitude"           "specimen_voucher"   "dev_stage"

Next I’ve made a small treatement and tibbles that count the frequency of the values for the three variables. This should help to categorize the data in the future

  • Countrys
country_levels <- 
  analysis_data$country %>%
  as.factor() %>%
  levels()

country_levels %>% 
  sample() %>% 
  head()
## [1] "Brazil: Jardinopolis"                
## [2] "Dominican Republic: Santo Domingo"   
## [3] "Japan: Yokosuka"                     
## [4] "France: Libourne"                    
## [5] "USA: Lahey Hospital & Medical Center"
## [6] "USA: MA"
analysis_data$country %>% 
  str_remove_all("\\:.*") %>% 
  as.tibble() %>%
  na.omit() %>% 
  count(value) %>% 
  arrange(desc(n)) %>% 
  head()
analysis_data$country %>% 
  str_remove_all("\\:.*") %>%
  na.omit() %>%
  as.factor() %>% 
  levels()
##   [1] "Afghanistan"           "Algeria"               "Antarctica"           
##   [4] "Argentina"             "Australia"             "Austria"              
##   [7] "Bahrain"               "Bangladesh"            "Belgium"              
##  [10] "Bolivia"               "Brazil"                "Brunei"               
##  [13] "Bulgaria"              "Burkina Faso"          "Cambodia"             
##  [16] "Cameroon"              "Canada"                "Chile"                
##  [19] "China"                 "Colombia"              "Costa Rica"           
##  [22] "Croatia"               "Cuba"                  "Czech Republic"       
##  [25] "Denmark"               "Dominican Republic"    "Ecuador"              
##  [28] "Egypt"                 "Estonia"               "Finland"              
##  [31] "France"                "Gabon"                 "Gambia"               
##  [34] "Georgia"               "Germany"               "Ghana"                
##  [37] "Greece"                "Guam"                  "Guinea"               
##  [40] "Guinea-Bissau"         "Guyana"                "Hong Kong"            
##  [43] "Hungary"               "India"                 "Indonesia"            
##  [46] "Iran"                  "Ireland"               "Israel"               
##  [49] "Italy"                 "Japan"                 "Jordan"               
##  [52] "Kazakhstan"            "Kenya"                 "Korea"                
##  [55] "Kosovo"                "Laos"                  "Latvia"               
##  [58] "Lebanon"               "Lithuania"             "Malawi"               
##  [61] "Malaysia"              "Mali"                  "Mexico"               
##  [64] "Morocco"               "Mozambique"            "Myanmar"              
##  [67] "Netherlands"           "New Zealand"           "Niger"                
##  [70] "Nigeria"               "Norway"                "Pakistan"             
##  [73] "Panama"                "Papua New Guinea"      "Paraguay"             
##  [76] "Peru"                  "Philippines"           "Poland"               
##  [79] "Portugal"              "Puerto Rico"           "Qatar"                
##  [82] "Republic of the Congo" "Romania"               "Russia"               
##  [85] "Rwanda"                "Saudi Arabia"          "Serbia"               
##  [88] "Singapore"             "Slovakia"              "Slovenia"             
##  [91] "South Africa"          "South Korea"           "Spain"                
##  [94] "Sri Lanka"             "Sudan"                 "Sweden"               
##  [97] "Switzerland"           "Taiwan"                "Tajikistan"           
## [100] "Tanzania"              "Thailand"              "Togo"                 
## [103] "Tonga"                 "Tunisia"               "Turkey"               
## [106] "Ukraine"               "United Arab Emirates"  "United Kingdom"       
## [109] "Uruguay"               "USA"                   "Viet Nam"             
## [112] "Zaire"                 "Zambia"
  • Host
hosts_levels <-
  analysis_data$host %>% 
  as.factor() %>%
  levels()

hosts_levels %>% 
  sample() %>% 
  head()
## [1] "swine"             "Cygnus buccinator" "okapi"            
## [4] "Cow"               "Bos indicus"       "Rat"
analysis_data$host %>% 
  na.omit() %>% 
  as.tibble() %>% 
  count(value) %>% 
  arrange(desc(n)) %>% 
  head()
  • Isolation source
isolation_source_levels <- 
  analysis_data$isolation_source %>% 
  as.factor() %>% 
  levels()

isolation_source_levels %>% 
  sample() %>% 
  head()
## [1] "CGC"                                      
## [2] "Urine sample from patient with Septicemia"
## [3] "poultry fecal material"                   
## [4] "Sanger Centre via Imperial College"       
## [5] "slaughterhouse"                           
## [6] "cow small intestine"
analysis_data$isolation_source %>% 
  str_replace_all("[sS]tool.*", "Stool") %>%
  str_replace_all("[Ff]ece.*", "Feces") %>% 
  str_replace_all("wastewater.*", "wastewater") %>% 
  na.omit() %>%
  as.tibble() %>% 
  count(value) %>%
  arrange(desc(n)) %>% 
  head()

Getting test data

From a genetic approach this data could be things related to many aspects of bacteria genetics:

  • presence or absence of all or certain genes;
  • nucleotide sequence of certain genes;
  • frequence of genes
  • ratio of nucleotide base

Well at first glance the more simple to do would be presence or of all genes, then the following steps are foucused in getting all genes from annotated genomes of NCBI database

bacteria_id <- analysis_data$ID

big_gene_tib <- tibble()#generating empty tibble
erro_indice <- tibble()

#Loop for that, based on bacteria ID, parse genbank files and then select genes and 
#put them on a tibble. To parse the genome genbankr from bioconductor is used 

tib_genome <- #function that produce a tibble with gene infomration and targets informations 
  function(genome) { #that was found using rentrez package
    data_2bind <-  analysis_data %>% 
      filter(
        ID == bacteria_id[i]) %>%
      select(
        sub_strain, 
        country, 
        isolation_source, 
        host)
    
    tibble( 
      data_2bind,
      bacteria_ID = bacteria_id[i], 
      type = genome$type,
      genes_composition = genome$gene,
      locus_tag =  genome$locus_tag,
      loctype = genome$loctype,
      gene_ID = genome$gene_id
    )
  }

checking_true <- 0
checking_false <- 0
for ( i in 1:length(bacteria_id)) {
  
  print(paste0( i ,' accessing genome ', bacteria_id[i]))
  
  tryCatch({
    
    gene_accession <- 
      genbankr::GBAccession(bacteria_id[i])
    
    parsing_genome <- 
      genbankr::readGenBank(gene_accession) #slower line to be processed 
    
    getting_genes  <- 
      parsing_genome %>%
      genbankr::genes()
    
  }, error = function(e){
    
    print(paste0("Deu merda no indice: ", i))
    
    erro_indice_lil <- bacteria_id[i] %>% as.tibble()
    erro_indice <- bind_rows(erro_indice, erro_indice_lil)
    
  })
  print(paste0("seeing if there's gene annotation in ",  bacteria_id[i]))
  
  #some genomes does not have annotation, therefore, there is two options, the first
  #would be save only those that have it. the other would be make the annotations based
  #on database. The last would take a really long time studying and processing the data 
  
  if (length(getting_genes) >= 1 ) { 
    
    print(paste0(bacteria_id[i], " checked"))
    
    
    gene_data <- #here I select eligible variables and drop na values in genes  
      tib_genome(genome = getting_genes ) %>% 
      select(bacteria_ID, 
             country,
             host,
             isolation_source,
             genes_composition) %>% 
      mutate(rn = data.table::rowid(genes_composition)) %>%
      drop_na(genes_composition)
    
    
    gene_data <-  #here I eliminate all genes that are repeated
      gene_data[!duplicated((gene_data$genes_composition)),]
    
    checking_true <-
      checking_true + 1
    
    big_gene_tib <- 
      bind_rows(big_gene_tib,
                gene_data)
    
  } else {
    
    print(paste0(bacteria_id[i] ," ", length(getting_genes) >= 1))
    
    checking_false <- checking_false + 1
  }
} 

#check if data was correctly prospected is important. Then I sould work to find out how to 
#do that 

count_TF <- tibble(checking_false, checking_true)

expected_T <-
  big_gene_tib$bacteria_ID %>%
  unique() %>%
  length()

expected_T == count_TF$checking_true

#saving brute data

write.csv(big_gene_tib, "big_gene_tib_ALL.csv", row.names = F)

This step is a complicated one, many problems happend throughtout the production of the script.

  • Many genomes does not have annotations, then they are discarted.
  • A big ammount have problems when they are parsed. This does not error, but genbankr give us a warning. Next is a example
[1] "1167 accessing genome NZ_AKVX01000001.1"
Translation product seems to be missing for 53 of 4294 CDS annotations. Setting to ''
No exons read from genbank file. Assuming sections of CDS are full exons
No transcript features (mRNA) found, using spans of CDSs
[1] "seeing if there's gene annotation in NZ_AKVX01000001.1"
[1] "NZ_AKVX01000001.1 checked"
* No transcript features (mRNA) found, using spans of CDSs
  • This error also happens repeatedly
Error in entrez_check(response) : 
HTTP failure: 502, bad gateway. This error code is often returned when trying 
to download many records in a single request.  
Try using web history as described in the rentrez tutorial`

Anyway I was able to make this tibble with all test varibles and targets variable.

## [1] "Number of test variables"
## [1] 6455
## [1] "Number of bacteria"
## [1] 492

Further, having in mind that some annotations were lost in parsing I wanted to know the frequency of annotations in the genome. How many annotated genes there are in most of E. coli ?

big_gene_tib %>% 
  group_by(bacteria_ID) %>% 
  count() %>% 
  arrange(desc(n)) %>% 
  head()

-> The most largely annotated genome has 4566 genes

big_gene_tib %>% 
  group_by(bacteria_ID) %>% 
  count() %>% 
  arrange(n) %>%
  head()

-> The least annotated genome has 26 gens

Counting_genomes <- 
  big_gene_tib %>% 
  group_by(bacteria_ID) %>% 
  count() %>% 
  arrange(desc(n))

Counting_genomes$n %>% 
  as.numeric() %>% 
  median()
## [1] 2795
Counting_genomes$n %>% 
  as.numeric() %>% 
  mean()
## [1] 2513.25
box <- Counting_genomes %>% 
  mutate(GENES_annotated = "GENES_annotated") %>% 
  ggplot(aes(y= n, x= GENES_annotated ))+
  geom_boxplot() +
  theme_minimal()

graph <-  Counting_genomes %>% ggplot(aes(x = n)) +
  geom_histogram(aes(y=..density..),binwidth = 15, alpha= .7) +
  geom_density(fill="blue", alpha= .2)+
  theme_minimal() 

plotly::ggplotly(graph)
plotly::ggplotly(box)

Next steps

  • Study about gene annotation and possible problems in this process;
  • Look the mean size of exoma of bacteria and check if the information found checks up;
  • Study statistics for further processing of data and analysis;
  • See genes that is annotated in all genomes;
  • Study about the ML models that could be used in the analysis;
  • Further prospection of data, only 1500 genomes were taken, but there is many more;
  • Begin to study the literature;
  • Learn how to management of errors and understand these errors.

Erros achados

Error in entrez_check(response) : 
HTTP failure: 502, bad gateway. This error code is often returned when trying 
to download many records in a single request.  
Try using web history as described in the rentrez tutorial
[1] "1167 accessing genome NZ_AKVX01000001.1"
Translation product seems to be missing for 53 of 4294 CDS annotations. Setting to ''
No exons read from genbank file. Assuming sections of CDS are full exons
No transcript features (mRNA) found, using spans of CDSs
[1] "seeing if there's gene annotation in NZ_AKVX01000001.1"
[1] "NZ_AKVX01000001.1 checked"


No transcript features (mRNA) found, using spans of CDSs